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Automatic keyphrases extraction (AKE) is a principal task in natural 
language processing (NLP). Several techniques have been exploited to 
improve the process of extracting keyphrases from documents. Deep 
learning (DL) algorithms are the latest techniques used in prediction and 
extraction of keyphrases. DL is one of the most complex types of machine 
learning, relying on the use of artificial neural networks to make the machine 
follow the same decision-making path as the human brain. In this paper, we 
present a review of deep learning-based methods for AKE from documents, 
to highlight their contribution to improving keyphrase extraction 
performance. This review will also provide researchers with a collection of 
data and information on the mechanisms of deep learning algorithms in the 
AKE domain. This will allow them to solve problems encountered by AKE 
approaches and propose new methods for improving key-extraction 


performance. 
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1. INTRODUCTION 

Keyphrase is an expression that identifies one of the main topics or one of the main ideas of a 
document. Automatic keyphrase extraction is essential for numerous natural language processing (NLP) tasks 
such as clustering [1] information retrieval [2], and text summarization [3]. Deep learning (DL) is one of the 
most important solutions proposed for automatic keyphrase extraction, because DL algorithms have the 
ability to understand the complex relationships between a large number of interrelated variables [4]. 
Moreover, traditional machine learning (ML) algorithms are not able to process raw input data, but DL 
algorithms helped overcome this limitation [5]. Recently several DL algorithms have been proposed [6]. 
These algorithms have been widely used to improve many tasks, such as keyphrase extraction [7], machine 
translation [8], sentiment analysis [9], question-answer systems [10], words recognition system [11], and 
recommender system [12]. In contrast, we found that most of the reviews that focused on the use of DL 
algorithms did not discuss at length their use of the keyphrase extraction task, but rather on the basis of a set 
of tasks [6], [13]. This makes it difficult for new researchers to understand how to use DL algorithms to 
improve keyphrases extraction performance. 

The objective of this article is to review DL-based keyphrases extraction approaches in order to 
provide an overview of the best algorithms for this task. As well as the datasets used to train and test these 
approaches. This review will enable researchers to gain a better understanding of how deep learning 
algorithms can be used to extract keyphrases and propose new approaches that outperform the current one. 
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The content of our paper will be as follows. In section 2, we will introduce deep learning algorithms, 
especially those used in NLP tasks. In section 3, we will discuss deep learning-based keyphrase extraction 
approaches. The empirical results of these methods are then discussed in section 4. We will then have a general 
discussion in section 5. Finally, we will conclude our paper in section 6 as well as future research directions. 


2. DEEP LEARNING ALGORITHMS 

Currently, DL algorithms are the most efficient among all machine learning algorithms [14]. In this 
section we will present the most important and well-known deep learning algorithms and their fields of 
application. Generally, most of the reviews like [6] classify these algorithms into several models which are 
convolutional neural network, auto-encoder, deep belief network, recurrent neural network (RNN), 
generative adversarial network, and deep reinforcement learning. While [15] classifies them into supervised, 
unsupervised and hybrid algorithms. 


2.1. Multilayer perceptrons 

Multi-layered perceptron (MLP) [16], is a neural network made up of several layers. In addition to 
input and output layers, MLP contains many hidden layers (by default, MLP has three hidden layers). Each 
layer is made up of a variable number of neurons. A neuron has inputs, which are real values, denoted by x1, 
... Xn, and an output, denoted y, see Figure 1. 
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Figure 1. The multilayer perceptron architecture 


To solve any problem by MLP it is necessary to determine the best weights of the lines connecting 
the neurons. For this, MLP uses backpropagation [17] as a training method. This requires the use of a 
differentiable activation function such as the sigmoid, rectified linear unit (ReLU) and tanh functions 
(Table 1), in order to iteratively define the weights in the network, with the aim of minimizing the deviation 
at the targeted output. MLP is used in various applications such as machine translation, speech recognition, 
and image classification. Moreover, the performance of some methods based on machine learning algorithms 
[18], [19] can be improved by using MLP. 


2.2. Convolutional neural networks 

Convolutional neural networks (CNN) [20] are types of artificial neural networks based on the idea 
that neurons in the visual cortex search for features. Pre-processing in CNN is much less compared to other 
algorithms. For example, CNN receives an input text and defines the learnable features in the text. CNNs are 
more often used in several fields, especially classification, document analysis, computer vision tasks, and 
images segmentation [21]. The architecture of CNN is constituted by three main types of layers, which are 
convolutional layers, mutualization layers and fully connected (FC) Layer. Figure 2 presents this architecture. 


2.2.1. The convolution phase 

In this step, a feature detector is applied to an area of the image to introduce the necessary features. 
To introduce non-linearity into the model after each convolution, CNN applies a linear transformation via the 
ReLU activation function to the feature matrix. CNN uses multiple convolutional layers, giving us a network 
that has a full understanding of the images in the dataset. 


2.2.2. The pooling phase 

A second layer called pooling is used to group convolutional features to reduce dimensions for 
easier preprocessing. Two types of pooling can be used (average and max pooling). Convolution and pooling 
can be considered as the first layer that allows CNN to accurately understand the features of the text, taking 
into account that complex texts require the multiplication of these layers. 
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2.2.3. The fully connected phase 

Once the convolution and the pooling are complete. The obtained matrix must be transformed into a 
vector to include it in an artificial neural network. The improvement of the network is done by a flow of 
information until the desired state is reached. 


2.3. Recurrent neural network 

RNN [13] differ from other neural networks in that they have internal memory which allows them to 
store information associated with an input. This allows RNN to define sequential properties of data to use 
them to predict future scenarios. This allows us to predict very accurately what will happen. Especially in 
NLP tasks, which is one of the most important application areas of RNN. It also uses long short-term 
memory (LSTM), to provide long-term memory, see Figure 3. 
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Figure 2. The convolutional neural network Figure 3. RNN s architecture with 
architecture LSTM cell 


The RNN uses two inputs for each neuron, the current input and the output of the preceding neuron. 
The decision is always tied to the current entry and what he has learned in the past. The weights are modified 
either by gradient descent [22] or backpropagation through time (BPTT) [23]. When RNN has a large 
number of time steps, it is better to use gradient descent because it is less computationally expensive than 
BPTT. RNNs are used in several NLP applications such as text generation, machine translation, and text 
summarization. RNN can also be used to predict the content of ancient manuscripts that have lost some of 
their content, which may perform better than some methods that predict handwritten words [24]. 


2.4. Autoencoder 

Autoencoder [25] is a type of unsupervised neural network. It is characterized by the fact that the 
input data is the same as the output data. An auto-encoder consists of an encoder, decoder, artificial neural 
networks (ANN), and code, which is a single layer of ANN that summarizes input data, it is also called 
representation latent space. 

Building an encoder requires an encoding method, a decoding method, and a loss function to 
compare the output to the target. The structure of the decoder is often the mirror image of the encoder as 
shown in Figure 4. But this is not necessary. The prerequisite is that the dimensions of the inlet and the outlet 
are identical. There are several types of autoencoders including, convolutional autoencoders, sparse 
autoencoders, and deep autoencoders [26]. The autoencoder can be used for solving tasks such as data 
analysis, information retrieval, or keyphrase extraction. 
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Figure 4. Autoencoder architecture 


2.5. Deep belief networks 

The deep belief network (DBN) [27] is one of the most important types of unsupervised deep neural 
networks. The DBN architecture consists of several layers of restricted Boltzmann machines (RBM) [28]. It 
is a stochastic RNN consisting of a layer of visible units, v, and a layer of hidden units, h, where each layer is 
connected to the previous and next layers to act as a hidden layer for the nodes that precede it and the role of 
the input layer for the nodes that follow. Figure 5 shows the architecture of the DBN. DBN remains a 
solution to many tasks. It can be used to reduce feature dimensions and to recognize images. It can also be 
used for handwriting recognition and speech recognition. 
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Figure 5. Deep belief networks architecture 


2.6. Activation function 

All artificial neural networks use non-linear functions to be able to bound the result of a summation 
in a neuron. Generally, its role is to determine whether or not to activate a neural response. These are also 
called activation functions that we perform before sending the value of the neuron to the next layer. Several 
types of activation functions are used by DL algorithms. Table 1 presents the most popular functions. 


Table 1. Activation functions used by DL algorithms 
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3. KEYPHRASES EXTRACTION APPROACHES 

Keyphrase extraction approaches are varied depending on the techniques used, where Nikzad- 
Khasmakhi et al. [29] categorizes them into textual, graph-based, and hybrid models. Chi and Hu [30] 
classifies them into supervised, unsupervised, and deep learning. In this section, we will present methods for 
extracting keyphrases that rely on deep learning. 

We will divide our presentation into two parts. The first includes methods that only extract 
keyphrases mentioned in the text. While the second part presents methods that also predict keyphrases that 
are not mentioned in the text. 


3.1. Keyphrases extraction 

The approach proposed [31] is the first to use deep learning techniques to extract keyphrases from 
the text. It is based on a neural network trained to determine whether a candidate phrase is a keyphrase or not, 
using the values of four features, which are term frequency, inverted document frequency, appearances in the 
document title, and the frequency of appearance in the paragraphs. The disadvantage of this method is that it 
gives all keyphrases the same importance. Thus, we cannot choose a specific number of the most important 
keyphrases. For this, Sarkar et al. [32] proposes to use a trained multilayer neural network to classify the 
candidate phrases according to their probability of being keyphrases or not. To choose the number of desired 
keyphrases. Other deep learning algorithms are also used like RNN which was exploited [33] to propose an 
automatic keyphrases extraction (AKE) method from tweets. The RNN model used has two hidden layers. 
The first is used to identify keyphrase information, while the second extracts the keyphrase using a sequence 
labeling approach. 


3.2. Keyphrase generation 

Some methods attempt in addition to extracting the keyphrases mentioned in the text, to predict the 
keyphrases not mentioned in the text. Meng et al. [34] propose a supervised approach to predicting 
keyphrases based on an auto-encoder that captures the semantic meaning of content via the RNN method. 
The approach focuses on compressing the original text into a hidden layer using an encoder and predicting a 
keyphrase using a decoder. However, this approach suffered from some problems, the most important of 
which is the prediction of keyphrases that express the same meaning, so the extracted keyphrases do not 
cover the topics of the document. 

To overcome these problems, Chen et al. [35] corrected the previous approach, by using CorrRNN, 
to predict keyphrases that do not have the same meaning and cover the topics of the document. This 
correction requires a large amount of labeled data for training. Ye and Wang [36] attempted to propose a 
method that reduces the amount of data prepared for training using term frequency-inverse document 
frequency (TFIDF) and TextRank [37], to obtain the set of keyphrases used to train a multitasking pattern. 
Wang et al. [38] also proposed creation of a topic-based adversarial neural network (TANN) that uses both 
labeled and unlabeled data to reduce the amount of data used for training. Basaldella et al. [39] believe that 
exploiting the preceding and following context of a given phrase can help predict keyphrases, for this, 
propose a bidirectional long short-term memory (BiLSTM) RNN network predicts keyphrases. 

Other methods not only use deep learning techniques, but also add other techniques such as 
conditional random field (CRF) [40], and sentence embedding [41] techniques. Alzaidy et al. [42] propose 
the combination of BiLSTM and CRF. The first captures the semantics of the phrase and the second gives a 
probability distribution over the phrase using the dependencies between the labels (keyphrase or non- 
keyphrase). Zhang and Xiao [43] propose a model based on seq2seq RNN, which can extract both keyphrases 
present and predict others not existing in the document by capturing the semantic, linguistic, and statistical 
information. Santosh et al. [44] propose document-level attention for keyphrase extraction (DAKE), a model 
that combines BiLSTM and CRF which is enhanced with interest at the document level and a gateway 
mechanism to improve the extraction of key phrases from scientific documents. Also, Huanqin et al. [45] 
propose a method that relies on the use of keyphrases mentioned in the text, to construct keyphrases not 
mentioned in the text using a mask-predict method. 


3.3. Deep learning techniques for AKE 

The DL techniques used by the keyphrase extraction methods that we presented in the previous 
paragraph were divided into two sets. Traditional techniques like multilayer feed-forward neural network and 
multilayer perceptron neural network, and the modern techniques as encoders and RNN variants, such as 
LSTM, BiLSTM, and bidirectional gated recurrent unit (BiGRU). Table 2 shows the DL techniques used by 
each AKE method. 
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Table 2. DL techniques used by the ake method 


Approach DL techniques Activation function Type 

Wang et al. [31] Multilayer feed-forward neural Sigmoid Supervised backpropagation algorithm 
network 

Sarkar et al. [32] Multilayer perceptron neural Sigmoid Supervised backpropagation algorithm 
network 

Zhang et al. [33] RNN Sigmoid, Softmax Supervised stochastic gradient descent [46] 

Meng et al. [34] RNN Sigmoid Supervised BiGRU [47] 

Chen et al. [35] RNN Sigmoid, Softmax Supervised BiGRU 

Ye and Wang BiLSTM model Sigmoid, Tanh Semi-supervised self-learning algorithm [48] 

[36] LSTM model Softmax 

Wang et al. [38] BiLSTM network Sigmoid, Tanh, Supervised adversarial learning technique 
CNN ELU 

Basaldella et al. BiLSTM Softmax, Tanh, Supervised root mean square propagation 

[39] RNN Sigmoid optimization algorithm [49] 

Alzaidy et al. BiLSTM Tanh, Sigmoid Supervised stochastic gradient descent 

[42] 

Zhang and Xiao RNN Tanh Supervised skip-gram [50] 

[43] BiGRU Sigmoid, Softmax 
Unidirectional GRU 

Santosh et al. [44] BiLSTM Tanh, Sigmoid Supervised adam optimization method [49] 

Wu et al. [45] Prefix LM (encoder-decoder) [51] Softmax Supervised multitask learning [52] 


We also noted that only four activation functions remain preferred by the AKE methods, namely sigmoid 
tanh, exponential linear unit (ELU) and Softmax. These functions are suitable for the DL techniques used, 
especially the RNN technique and its variants which have been used by 70% of AKE methods studied as shown in 
Figure 6. The biggest problem with AKE methods that rely on DL techniques is that they are either supervised or 
semi-supervised, which requires providing datasets for training, which is not always available. 
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Figure 6. Percentage of use of DL techniques 


4. EMPIRICAL RESULTS 
In this section, we will describe the training and test datasets that were used by the studied AKE methods. 
In addition, the most commonly used evaluation metrics. Then, we discuss the results obtained by these methods. 


4.1. Datasets 

To train or evaluate AKE methods, several datasets are used. Table 3 presents the datasets used by the 
methods studied. However, methods based on DL techniques require large datasets for training. Unfortunately, 
before 2017 the largest database available contained only 2,304 scientific articles [53]. This is insufficient to 
train RNN. But with the construction of KP20K which contains 527,430 documents, it became the favorite of 
the methods that appeared after 2017. In addition, most of the studied methods relied on five datasets to 
evaluate their performance, Table 3 presents the performance evaluation datasets for AKE methods. There is 
also a recent dataset, KPTimes [54] that has not been used which provides 259,923 training documents. 


Table 3. Datasets used by the studied ake methods 


Dataset Documents Training documents Test documents Validation documents __ Usage rate (%) 


Inspec 2,000 1,000 500 500 50 
Krapivin 2,304 1,900 404 - 42 
NUS 211 - 211 - 50 
SemEval 288 188 100 - 58 
KP20K 527,430 527,030 20,000 20,000 58 
KDD 755 - 755 - 8 
WWW 1,330 - 1,330 - 8 
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All the methods studied were based on three metrics of performance evaluation. Most of AKE 
methods evaluate its performance based on the results of these metrics. They are: 


True Keyphrases 


Precision = 
True Keyphrases+False Keyphrases 
True Keyphrases 
Recall = —— 11 
True Keyphrases+False NonKeyphrase 
2xPrecisionxRecall 
F1_Score SS on S 
Precision+Recall 


(1) 


(2) 


(3) 


The results of these measurements remain relative because they are affected by the number of 
phrases extracted and the nature and length of the document. Also, methods that do not predict phrases that 
do not exist in the document will have fewer results when using these measures. It will therefore be necessary 
to think about other ways of evaluating performance that go beyond these constraints. 


4.3. Performance 


To evaluate the performance of the studied approaches. The authors relied on the evaluation metrics 
discussed in the previous paragraph by applying them to five datasets. Table 4 compares the average 
performance of methods that extract only keyphrases found in the document with methods that also predict 


keyphrases that are not mentioned in the document. 


Table 4. Comparison of the kp extraction and kp generation methods 


Dataset Fl-score KP extraction 


KP generation 


Inspec F@5 
F@10 
Krapivin F@5 
F@10 
NUS F@5 
F@10 
SemEval F@5 
F@10 
KP20K F@5 
F@10 


0.27 
0.21 
0.12 
0.15 
0.15 
0.19 
0.14 
0.17 


0.36 
0.33 
0.22 
0.26 
0.35 
0.41 
0.29 
0.31 
0.38 
0.34 


Thus, from these results, it is clear that methods that extract only the keyphrases mentioned in the 
document perform less well than methods that extract the keyphrases mentioned or not in the document. This 
can be explained by the fact that most of the datasets used to evaluate AKE methods contain documents in 
which keyphrases not mentioned in the document are specified [55]. Figure 7 shows the distribution of 
present and absent keyphrases according to each dataset. 


55,67% 54,64% 


INSPEC KRAPIVIN NUS 


55,63% 


SEMEVAL2010 


62,77% 


KP20K 


O Present keyphrases 


o Absent keyphrases 


Figure 7. Percentage of present and absent keyphrases in datasets 


We then analyzed the performance according to the DL technique used. We calculated the average 
performance of the methods studied according to the DL technique used. Table 5 shows the results obtained 


for each dataset. 
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Despite the need to provide a large dataset for training BiLSTM-based approaches. The results 
presented in Table 5 clearly show that these methods perform better than the others. Therefore, it is 
recommended to use the BiLSTM technique to extract and predict keyphrases from a document, especially 
with the availability of datasets for training and validation [54]. 


Table 5. Performance of each dl technique according to each dataset 
Dataset___Fl-score_ _ MLP Simple RNN_ _ LSTMRNN__ BiLSTMRNN__ BiGRU RNN 


Inspec F@5 0.17 0.22 0.25 0.29 0.27 
F@10 0.21 0.24 0.24 0.31 0.30 
Krapivin F@5 0.15 0.19 0.29 0.32 0.31 
F@10 0.13 0.15 0.22 0.28 0.26 
NUS F@5 0.28 0.32 0.38 0.46 0.39 
F@10 0.30 0.31 0.33 0.41 0.32 
SemEval F@5 0.16 0.21 0.27 0.29 0.27 
F@10 0.13 0.18 0.30 0.32 0.32 
KP20K F@5 - - 0.31 0.37 0.33 
F@10 - - 0.33 0.35 0.29 


5. DISCUSSION 

Recently, there has been a lot of interest in using deep learning techniques in several fields. In this 
study, we highlight deep learning techniques to give an idea of the techniques that correspond to each domain 
to know which techniques are exploitable in NLP tasks. Especially the process of extracting keyphrases from 
the document. When analyzing the studied AKE methods, we found that the authors limit themselves only to 
RNNs and their variants because they are a good solution to sequential data problems, such as speech and 
language processing [56]. 

Through the results obtained, we found that the methods which extract the keyphrases present only 
in the text are less efficient than the methods which also predict the absent keyphrases. One of the reasons for 
the superiority of these models is due to the evaluation method used, which is based on datasets in which half 
of the keyphrases are not mentioned in the documents. Thus, AKE models that only extract the keyphrases 
mentioned in the document remain less performant. Empirical results also showed that models based on 
BiLSTM have higher extraction and prediction ability than other techniques. On the other hand, training 
these models requires a large amount of data. Therefore, it is recommended to use the BiLSTM technique to 
extract and predict keyphrases, especially with datasets available for training and validation. 

CNNs are more efficient for data consisting of matrices such as images and videos, which explains 
the dominance of this technique on computer vision tasks [16]. However, it also performed well when used in 
NLP tasks [57]. Since CNN has a great capacity for classification, its use to predict key phrases not 
mentioned in the document can improve the performance of AKE approaches. Additionally, we encourage 
researchers interested in keyphrase extraction and prediction, to further research into deep learning 
techniques, especially regarding the amount of input, number of hidden layers, and discover key phrase 
features, loss functions, and activation functions. This will inevitably lead to better-performing keyphrase 
extraction and prediction methods. 


6. CONCLUSION 

Keyphrases are one of the solutions exploited to improve the performance of NLP tasks such as 
information retrieval, summarizing, classifying, and clustering documents. Our article presents a review of 
keyphrase methods that use deep learning techniques to understand how to use deep learning in the extraction 
and prediction of keyphrases. Most AKE models that used DL techniques, chose the RNN or one of its 
variants such as LSTM, BiLSTM, and BiGRU. Our review also included an evaluation of the performance of 
the AKE methods studied. Through the results obtained, it is shown that the BiLSTM technology performed 
better than the other techniques and that the methods which predict the absent keyphrases performed better 
than the methods which only extract the keyphrases mentioned in the document. Generally, the performance 
of AKE methods based on deep learning remains better than other methods, especially unsupervised 
methods, but on the other hand, their weak point remains that they require a large amount of data for learning 
and validation. In the future, we will expand our study to develop an unsupervised system that takes 
advantage of deep learning techniques and focuses on predicting keyphrases in documents whether they are 
present or absent. 
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